Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

ABanditNAS: Anti-Bandit for Neural Architecture Search

Sampling operations

Anti-Bandit LCB

Reducing the search space

m ^{m v}

(

m ^{m v}

CONV

3x3

CONV

5x5

MAX POOL

3x3

Identity

CONV

3x3

Depth-Wise

CONV 3x3

CONV

5x5

CONV

3x3

MAX POOL

3x3

Depth-Wise

CONV 3x3

Identity

CONV

5x5

MAX POOL

3x3

Identity

Depth-Wise

CONV 3x3

ȳ^ሺ݅ǡ݆ሻ

log

(i,j)

k,t

(i,j)

k,t

s (o

)= m

Anti-Bandit UCB

log

(i, j)

k,t

(i, j)

k,t

s (o

)= m

FIGURE 4.1

ABanditNAS is divided into two steps: sampling using LCB and abandoning using UCB.

they are conﬁrmed to be bad. Meanwhile, when well trained, weight-free operations will

be compared only with parameterized operations. On the other hand, with the operation

pruning process, the search space becomes smaller and smaller, leading to an eﬃcient search

process.

4.2.1

Anti-Bandit Algorithm

Our goal is to search for network architectures eﬀectively and eﬃciently. However, a dilemma

exists for NAS about whether to maintain a network structure that oﬀers signiﬁcant rewards

(exploitation) or to investigate further other network structures (exploration). Based on

probability theory, the multi-armed bandit can solve the aforementioned exploration-versus-

exploitation dilemma, which makes decisions among competing choices to maximize their

expected gain. Speciﬁcally, we propose an anti-bandit that chooses and discards the arm k

in the trial based on

˜rk −^˜δk ≤rk ≤˜rk + ^˜δk,

(4.1)

where rk, ˜rk and ^˜δk are the true reward, the average reward, and the estimated vari-

ance obtained from arm k. ˜rk is the value term that favors actions that historically

perform well, and ^˜δk is the exploration term that gives actions an exploration bonus.

˜rk −^˜δk and ˜rk + ^˜δk can be interpreted as the lower and upper bounds of a conﬁdence

interval,

The traditional UCB algorithm, which optimistically substitutes ˜rk+^˜δ for rk, emphasizes

exploration; however, ignores exploitation. Unlike the UCB bandit, we further exploited the

LCB and UCB to balance exploration and exploitation. A smaller LCB usually has little

expectations but signiﬁcant variance and should be given a larger chance to be sampled for

more trials. Then, based on the observation that the worst operations in the early stage

usually have worse performance at the end [291], we use UCB to prune the operation with

the worst performance and reduce the search space. In summary, we adopt LCB, ˜rk −^˜δ,

to sample the arm, which should be further optimized, and use UCB, ˜rk + ^˜δ, to abandon

the operation with the minimum value. Because the variance is bounded and converges, the

operating estimate value is always close to the true value and gradually approaches the true

value as the number of trials increases. Our anti-bandit algorithm overcomes the limitations

of an exploration-based strategy, including levels of understanding and suboptimal gaps. The

deﬁnitions of the value term and the variance term and the proof of our proposed method

are shown below.

Deﬁnition 1. If an operation on arm k has been recommended nk times, rewardi is the

reward on arm k on all trails. The value term of anti-bandit is deﬁned as follows

˜rk =

rewardi

(4.2)